TL;DR

Given the results of the analysis in fluidigm.Rmd, here I explore the role of the V intercept in the model.

One hypothesis is that this intercept acts as a size factor. Is it true? And if so, Will the data be explained completely by only the intercept and one factor W?

There are four possible models, if we keep X fixed: V present in both \(\mu\) and \(\pi\), only in \(\mu\), only in \(\pi\) and in neither.

\(\gamma_pi\) captures the detection rate, while \(\gamma_mu\) captures the coverage (sequencing depth). Since the second component of W is also highly correlated with coverage, the model can have either \(\gamma_mu\) or two components of W. We need to figure out if this is a general feature or if it’s this particular dataset that exhibit confounding between biology and coverage.

Data

Since we didn’t see much difference between the high and low coverage data, I will focus on the latter. I will not include any normalization / offset.

Select low coverage cells, filter out the genes that do not have at least 10 counts in at least 10 cells.

data("fluidigm")
fluidigm_low <- fluidigm[,which(colData(fluidigm)$Coverage_Type=="Low")]
filter <- apply(assay(fluidigm_low)>10, 1, sum)>=10

Number of retained genes:

print(sum(filter))
## [1] 2506
fluidigm_low <- fluidigm_low[filter,]
low <- assay(fluidigm_low)

First, let’s look at PCA (of the log counts) for reference.

bio <- as.factor(colData(fluidigm_low)$Biological_Condition)
cl <- as.factor(colData(fluidigm_low)$Cluster2)

detection_rate <- colSums(low>0)
coverage <- colSums(low)

pca <- prcomp(t(log1p(low)))
plot(pca$x, col=cols[bio], pch=19, main="PCA of log-counts, centered not scaled")
legend("topleft", levels(bio), fill=cols)

V intercept in both \(\mu\) and \(\pi\)

print(system.time(zinb_Vall <- zinbFit(low, ncores = 3, K = 2)))
## Warning in simpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations
##    user  system elapsed 
## 409.710  66.880 182.534

Plot the results with cells colored according to their biological condition.

plot(zinb_Vall@W, col=cols[bio], pch=19, xlab="W1", ylab="W2", main="Both V intercepts")
legend("topleft", levels(bio), fill=cols)

Explore \(\gamma_mu\) and \(\gamma_pi\)

One interpretation of the model is that \(\gamma_mu\) will act as a “size factor”. Here, we check whether it captures sequencing depth and/or detection rate.

#total number of detected genes in the cell
df <- data.frame(gamma_mu = zinb_Vall@gamma_mu[1,], gamma_pi = zinb_Vall@gamma_pi[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])

print(cor(df, method="spearman"))
##                  gamma_mu   gamma_pi detection_rate   coverage
## gamma_mu        1.0000000  0.8181818     -0.7086552  0.9677885
## gamma_pi        0.8181818  1.0000000     -0.9484577  0.7143357
## detection_rate -0.7086552 -0.9484577      1.0000000 -0.6348329
## coverage        0.9677885  0.7143357     -0.6348329  1.0000000

\(\gamma_pi\) clearly captures the detection rate and \(\gamma_mu\) clearly captures the coverage. Interestingly, the two estimates are positively correlated.

par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vall)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])

plot(rowMeans(log1p(getMu(zinb_Vall))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])

It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).

par(mfrow=c(2, 2))
boxplot(zinb_Vall@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vall@beta_pi[1,], main="Beta_Pi")

plot(t(zinb_Vall@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vall@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")

It seems that only a few genes drive the estimate of W.

One possibility is that, once we remove coverage, the data can be explained by only one factor. If that’s the case, plotting \(\gamma_mu\) vs. \(W_2\) should recover the PCA plot.

plot(zinb_Vall@gamma_mu[1,], zinb_Vall@W[,2], col=cols[bio], pch=19, xlab="Gamma_Mu", ylab="W2", main="Both V intercepts")
legend("bottomright", levels(bio), fill=cols)

plot(zinb_Vall@gamma_pi[1,], zinb_Vall@W[,2], col=cols[bio], pch=19, xlab="Gamma_Pi", ylab="W2", main="Both V intercepts")
legend("bottomright", levels(bio), fill=cols)

No V intercept

print(system.time(zinb_Vnone <- zinbFit(low, ncores = 3, K = 2, V=matrix(0, ncol=1, nrow=NROW(low)))))
##    user  system elapsed 
## 156.161  27.808  69.145
plot(zinb_Vnone@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("bottomright", levels(bio), fill=cols)

df <- data.frame(W1 = zinb_Vnone@W[,1], W2 = zinb_Vnone@W[,2], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])

print(cor(df, method="spearman"))
##                         W1          W2 detection_rate   coverage
## W1              1.00000000 -0.02478147      0.9038976 -0.6357955
## W2             -0.02478147  1.00000000      0.1709846 -0.6683566
## detection_rate  0.90389764  0.17098463      1.0000000 -0.6348329
## coverage       -0.63579545 -0.66835664     -0.6348329  1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vnone)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])

plot(rowMeans(log1p(getMu(zinb_Vnone))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])

It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).

par(mfrow=c(2, 2))
boxplot(zinb_Vnone@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vnone@beta_pi[1,], main="Beta_Pi")

plot(t(zinb_Vnone@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vnone@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")

V intercept only in \(\mu\)

V <- cbind(rep(0, NROW(low)), rep(1, NROW(low)))

print(system.time(zinb_Vmu <- zinbFit(low, ncores = 3, K = 2, V=V, which_V_mu=2L, which_V_pi=1L)))
## Warning in simpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations
## Warning in optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu, logitPi =
## logitPi, : NA/Inf replaced by maximum positive value
##    user  system elapsed 
## 464.293  70.466 203.342
plot(zinb_Vmu@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("topleft", levels(bio), fill=cols)

df <- data.frame(W1 = zinb_Vmu@W[,1], W2 = zinb_Vmu@W[,2], gamma_mu = zinb_Vmu@gamma_mu[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])

print(cor(df, method="spearman"))
##                        W1         W2   gamma_mu detection_rate   coverage
## W1              1.0000000 -0.5879808 -0.5746066      0.8053149 -0.5478584
## W2             -0.5879808  1.0000000  0.3079108     -0.6664554  0.2897727
## gamma_mu       -0.5746066  0.3079108  1.0000000     -0.6754155  0.9843531
## detection_rate  0.8053149 -0.6664554 -0.6754155      1.0000000 -0.6348329
## coverage       -0.5478584  0.2897727  0.9843531     -0.6348329  1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vmu)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])

plot(rowMeans(log1p(getMu(zinb_Vmu))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])

It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).

par(mfrow=c(2, 2))
boxplot(zinb_Vmu@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vmu@beta_pi[1,], main="Beta_Pi")

plot(t(zinb_Vmu@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vmu@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")

V intercept only in \(\pi\)

print(system.time(zinb_Vpi <- zinbFit(low, ncores = 3, K = 2, V=V, which_V_mu=1L, which_V_pi=2L)))
##    user  system elapsed 
## 129.919  21.942  58.287
plot(zinb_Vpi@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("bottomright", levels(bio), fill=cols)

df <- data.frame(W1 = zinb_Vpi@W[,1], W2 = zinb_Vpi@W[,2], gamma_pi = zinb_Vpi@gamma_pi[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])

print(cor(df, method="spearman"))
##                         W1          W2   gamma_pi detection_rate
## W1              1.00000000  0.01088287 -0.7012675      0.7671580
## W2              0.01088287  1.00000000 -0.3999563      0.3585782
## gamma_pi       -0.70126748 -0.39995629  1.0000000     -0.9832927
## detection_rate  0.76715801  0.35857819 -0.9832927      1.0000000
## coverage       -0.51459790 -0.80729895  0.6306818     -0.6348329
##                  coverage
## W1             -0.5145979
## W2             -0.8072990
## gamma_pi        0.6306818
## detection_rate -0.6348329
## coverage        1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vpi)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])

plot(rowMeans(log1p(getMu(zinb_Vpi))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])

It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).

par(mfrow=c(2, 2))
boxplot(zinb_Vpi@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vpi@beta_pi[1,], main="Beta_Pi")

plot(t(zinb_Vpi@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vpi@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")

Are \(\gamma_mu\) and \(\gamma_pi\) enough to explain the data?

Here, I try to fit a model without W, to see if detection rate and coverage are enough to completely explain the data.

print(system.time(zinb_noW <- zinbFit(low, ncores = 3)))
##    user  system elapsed 
##  74.157  19.171  38.160
plot(zinb_noW@gamma_mu, zinb_noW@gamma_pi, col=cols[bio], pch=19, xlab="Gamma_Mu", ylab="Gamma_Pi")
legend("bottomright", levels(bio), fill=cols)

print(cor(zinb_noW@gamma_mu[1,], zinb_noW@gamma_pi[1,], method="spearman"))
## [1] 0.8332605